INTERSPEECH.2014 - Language and Multimodal

Total: 114

#1 Theme identification in human-human conversations with features from specific speaker type hidden spaces [PDF] [Copy] [Kimi1]

Authors: Mohamed Morchid ; Richard Dufour ; Mohamed Bouallegue ; Georges Linarès ; Renato De Mori

This paper describes a research on topic identification in a real-world customer service telephone conversations between an agent and a customer. Separate hidden spaces are considered for agents, customers and the combination of them. The purpose is to separate semantic constituents from the speaker types and their possible relations. Probabilities of hidden topic features are then used by separate Gaussian classifiers to compute theme probabilities for each speaker type. A simple strategy, that does not require any additional parameter estimation, is introduced to classify themes with confidence indicators for each theme hypothesis. Experimental results on a real-life application show that the use of features from speaker type specific hidden spaces capture useful semantic contents with significantly superior performance with respect to independent word-based features or a single set of features. Experimental results also show that the proposed strategy makes it possible to perform surveys on collections of conversations by automatically selecting processed samples with high theme identification accuracy.

#2 Learning phrase patterns for text classification using a knowledge graph and unlabeled data [PDF] [Copy] [Kimi]

Authors: Alex Marin ; Roman Holenstein ; Ruhi Sarikaya ; Mari Ostendorf

This paper explores a novel method for learning phrase pattern features for text classification, employing a mapping of selected words into a knowledge graph and self-training over unlabeled data. Using Support Vector Machine classification, we obtain improvements over lexical and fully-supervised phrase pattern features in domain and intent detection for language understanding, particularly in conjunction with the use of unlabeled data. Our best results are obtained using unlabeled data filtered for both model training and feature learning based on the confidence of the baseline classifiers.

#3 Targeted feature dropout for robust slot filling in natural language understanding [PDF] [Copy] [Kimi]

Authors: Puyang Xu ; Ruhi Sarikaya

In slot filling with conditional random field (CRF), the strong current word and dictionary features tend to swamp the effect of contextual features, a phenomenon also known as feature undertraining. This is a dangerous tradeoff especially when training data is small and dictionaries are limited in their coverage of the entities observed during testing. In this paper, we propose a simple and effective solution that extends the feature dropout algorithm, directly aiming at boosting the contribution from entity context. We show with extensive experiments that the proposed technique can significantly improve the robustness against unseen entities, without degrading performance on entities that are either seen or exist in the dictionary.

#4 Spoken question answering using tree-structured conditional random fields and two-layer random walk [PDF] [Copy] [Kimi1]

Authors: Sz-Rung Shiang ; Hung-yi Lee ; Lin-shan Lee

In this paper, we consider a spoken question answering (QA) task, in which the questions are in form of speech, while the knowledge source for answers are the webpages (in text) over the Internet to be accessed by an information retrieval engine, and we mainly focus on query formulation and re-ranking part. Because the recognition results for the spoken questions are less reliable, we use N-best lists in order to have higher probabilities to induce more correct keywords for the questions, but more noisy words are inevitably included as well. We therefore propose a hierarchical labeling method using tree-structured conditional random fields (CRF) to leverage the parse tree information or the syntactic structure obtained from the N-best-lists of the spoken questions, such that the queries for information retrieval can be better formulated. In addition, because queries formulated from the N-best results naturally generate more noisy information, we further propose to use two-layer random walk for re-ranking the retrieved webpages to produce better documents containing answers. Initial experiments performed on a set of question answering pairs in Mandarin Chinese verified that improved performance was achievable with the proposed approaches.

#5 Shrinkage based features for slot tagging with conditional random fields [PDF] [Copy] [Kimi1]

Authors: Ruhi Sarikaya ; Asli Celikyilmaz ; Anoop Deoras ; Minwoo Jeong

In this paper we propose a set of class-based features that are generated in an unsupservised fashion to improve slot tagging with Conditional Random Fields (CRFs). The feature generation is based on the idea behind shrinkage based language models, where shrinking the sum of parameter magnitudes in an exponential model tends to improve performance. We use these features with CRFs and show that they consistently improve the slot tagging performance against baselines on several natural language understanding tasks. Since the proposed features are generated in an unsupervised manner without significant computational overhead, the improvements in performance comes for free and we expect that the same features may result in gains in other tagging tasks.

#6 Cluster based Chinese abbreviation modeling [PDF] [Copy] [Kimi]

Authors: Yangyang Shi ; Yi-Cheng Pan ; Mei-Yuh Hwang

Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strategies are proposed to reduce the impact from data sparseness. First of all, in addition to using a traditional sequence labelling method — Conditional Random Fields (CRF), we propose to apply Recurrent Neural Network with Maximum Entropy Extension (RNNME), which actually shows similar performance as using crf in our experiment. Secondly, we propose to use training data clustering and latent topic modeling in abbreviation generation. Using training data clustering or topic modeling not only addresses the data sparseness, but also takes advantage of the fact that full-names from the same cluster or the same latent topic have similar abbreviation patterns. Our experimental results show that using manual clustering, the accuracy of abbreviation generation achieves relatively 8% improvement. Using Latent topics that are obtained from Latent Dirichlet Allocation (LDA), the accuracy achieves relative 10% improvement.

#7 Parsing named entity as syntactic structure [PDF] [Copy] [Kimi1]

Authors: Xiantao Zhang ; Dongchen Li ; Xihong Wu

Named entity recognition (NER) plays an important role in many natural language processing applications. This paper presents a novel approach to Chinese NER. It differentiates from most of the previous approaches mainly in three respects. First of all, while previous work is good at modeling features between observation elements, our model incorporates syntactic structure as higher level information. It is crucial for recognizing long named entities, which are one of the main difficulties of NER. Secondly, NER and syntactic analysis have been modeled separately in natural language processing until now. We integrate them in a unified framework. It allows the information from each type of annotation to improve performance on the other, and produces the consistent output. Finally, few studies have been reported on the recognition of nested named entities in Chinese. This paper presents a structured prediction model for Chinese nested named entity recognition. Our approach have been implemented through a joint representation of syntactic and named entity structures. We have provided empirical evidence that parsing model can utilize syntactic constraints for recognizing named entities, and exploit the composition patterns of named entities. Experiment results demonstrate the mutual benefits for each task and output syntactic structure of named entities.

#8 Detecting out-of-domain utterances addressed to a virtual personal assistant [PDF] [Copy] [Kimi1]

Authors: Gokhan Tur ; Anoop Deoras ; Dilek Hakkani-Tür

Using different sources of information for grammar induction results in grammars that vary in coverage and precision. Fusing such grammars with a strategy that exploits their strengths while minimizing their weaknesses is expected to produce grammars with superior performance. We focus on the fusion of grammars produced using a knowledge-based approach using lexicalized ontologies and a data-driven approach using semantic similarity clustering. We propose various algorithms for finding the mapping between the (non-terminal) rules generated by each grammar induction algorithm, followed by rule fusion. Three fusion approaches are investigated: early, mid and late fusion. Results show that late fusion provides the best relative F-measure performance improvement by 20%.

#9 Fusion of knowledge-based and data-driven approaches to grammar induction [PDF] [Copy] [Kimi1]

Authors: Spiros Georgiladakis ; Christina Unger ; Elias Iosif ; Sebastian Walter ; Philipp Cimiano ; Euripides Petrakis ; Alexandros Potamianos

Using different sources of information for grammar induction results in grammars that vary in coverage and precision. Fusing such grammars with a strategy that exploits their strengths while minimizing their weaknesses is expected to produce grammars with superior performance. We focus on the fusion of grammars produced using a knowledge-based approach using lexicalized ontologies and a data-driven approach using semantic similarity clustering. We propose various algorithms for finding the mapping between the (non-terminal) rules generated by each grammar induction algorithm, followed by rule fusion. Three fusion approaches are investigated: early, mid and late fusion. Results show that late fusion provides the best relative F-measure performance improvement by 20%.

#10 Improving named entity recognition with prosodic features [PDF] [Copy] [Kimi1]

Authors: Denys Katerenchuk ; Andrew Rosenberg

In natural language processing (NLP) the problem of named entity (NE) recognition in speech is well known, yet remains a challenge where performance is dependent on automatic speech recognition (ASR) system error rates. NEs are often foreign or out-of-vocabulary (OOV) words, leaving conventional ASR systems unable to recognize them. In our research, we improve a CRF-based NE recognition system by incorporating two styles of prosodic features, hypothesized ToBI labels and unsupervised clusters of acoustic features. ToBI-based features improve NE recognition by 6% absolute (F1:0.39 v.s. F1: 0.45) on automatically recognized spontaneous speech from ACE'05.

#11 Neural network models for lexical addressee detection [PDF] [Copy] [Kimi1]

Authors: Suman V. Ravuri ; Andreas Stolcke

Addressee detection for dialog systems aims to detect which utterances are directed at the system, as opposed to someone else. An important means for classification is the lexical content of the utterance, and N-gram models have been shown to be effective for this task. In this paper we investigate whether neural networks can enhance lexical addressee detection, using data from a human-human-computer dialog system. Even though we find no improvement from simply replacing the standard N-gram LM with a neural-network LM as class likelihood estimators, improved classification accuracy can be obtained from a modified neural net model that learns distributed word representations in a first training phase, and is trained on the utterance classification task in a second phase. We obtain additional gains by combining the class likelihood estimation and classification training criteria in the second phase, and by combining multiple model architectures at the score level. Overall, we achieve over 2% absolute reduction in equal error rate over the N-gram model baseline of 27%.

#12 Manipulating stance and involvement using collaborative tasks: an exploratory comparison [PDF] [Copy] [Kimi1]

Authors: Valerie Freeman ; Julian Chan ; Gina-Anne Levow ; Richard Wright ; Mari Ostendorf ; Victoria Zayats

The ATAROS project aims to identify acoustic signals of stance-taking in order to inform the development of automatic stance recognition in natural speech. Due to the typically low frequency of stance-taking in existing corpora that have been used to investigate related phenomena such as subjectivity, we are creating an audio corpus of unscripted conversations between dyads as they complete collaborative tasks designed to elicit a high density of stance-taking at increasing levels of involvement. To validate our experimental design and provide a preliminary assessment of the corpus, we examine a fully transcribed and time-aligned portion to compare the speaking styles in two tasks, one expected to elicit low involvement and weak stances, the other high involvement and strong stances. We find that although overall measures such as task duration and total word count do not indicate consistent differences across tasks, speakers do display significant differences in speaking style. Factors such as increases in speaking rate, turn length, and disfluencies from weak- to strong-stance tasks are consistent with increased involvement by the participants and provide evidence in support of the experimental design.

#13 The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load [PDF] [Copy] [Kimi1]

Authors: Björn Schuller ; Stefan Steidl ; Anton Batliner ; Julien Epps ; Florian Eyben ; Fabien Ringeval ; Erik Marchi ; Yue Zhang

The INTERSPEECH 2014 Computational Paralinguistics Challenge provides for the first time a unified test-bed for the automatic recognition of speakers' cognitive and physical load in speech. In this paper, we describe these two Sub-Challenges, their conditions, baseline results and experimental procedures, as well as the COMPARE baseline features generated with the openSMILE toolkit and provided to the participants in the Challenge.

#14 Filtering and subspace selection for spectral features in detecting speech under physical stress [PDF] [Copy] [Kimi1]

Authors: Jouni Pohjalainen ; Paavo Alku

This paper investigates approaches to modeling the time evolution of short-time spectral features in paralinguistic speech type classification, where we focus on detection of speech influenced by physical exertion. The time series model consists of autoregressive processes of multiple time scales and orders and is trained to describe the long-term dynamics of a given target speech class. Themodel is applied in two ways in improving long-term modeling in the detection task: 1) to perform predictive filtering of the features and 2) to automatically select instantaneous classification subspaces. The spectrum analysis method underlying the short-time features is also varied between the standard discrete Fourier transform and a time-weighted linear predictive method which yields smooth all-pole spectrum envelope models. Configurations of the proposed methods are evaluated in the Physical Load task of the Interspeech 2014 Computational Paralinguistics Challenge and show improvement over the baseline timbral classifier and the challenge baseline. Also the interrelationships among the methods are discussed.

#15 Automatic recognition of speaker physical load using posterior probability based features from acoustic and phonetic tokens [PDF] [Copy] [Kimi1]

Author: Ming Li

This paper presents an automatic speaker physical load recognition approach using posterior probability based features from acoustic and phonetic tokens. In this method, the tokens for calculating the posterior probability or zero-order statistics are extended from the conventional MFCC trained Gaussian Mixture Models (GMM) components to parallel phonetic phonemes and tandem feature trained GMM components. Phoneme recognizers from five different languages are employed to extract the phoneme posterior probabilities. We show that these histogram style features at both the acoustic and phonetic levels are effective and complementary for capturing the speaker physical load information from short utterances. Support vector machine is adopted as the supervised classifier. By combining the proposed methods with the OpenSMILE baseline which covers the acoustic and prosodic information further improves the final performance. The proposed fusion system achieves 70.18% and 72.81% unweighted accuracy on the validation and test set of the Munich Bio-voice Corpus for the binary physical load level recognition task in the INTERSPEECH 2014 Computational Paralinguistics Challenge.

#16 Canonical correlation analysis and local fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction [PDF] [Copy] [Kimi2]

Authors: Heysem Kaya ; Tuğçe Özkaptan ; Albert Ali Salah ; Sadık Fikret Gürgen

In this study we present our system for INTERSPEECH 2014 Computational Paralinguistics Challenge (ComParE 2014), Physical Load Sub-challenge (PLS). Our contribution is twofold. First, we propose using Low Level Descriptor (LLD) information as hints, so as to partition the feature space into meaningful subsets called views. We also show the virtue of commonly employed feature projections, such as Canonical Correlation Analysis (CCA) and Local Fisher Discriminant Analysis (LFDA) as ranking feature selectors. Results indicate the superiority of multi-view feature reduction approach to its single-view counterpart. Moreover, the discriminative projection matrices are observed to provide valuable information for feature selection, which generalize better than the projection itself. In our preliminary experiments we reached 75.35% Unweighted Average Recall (UAR) on PLS test set, using CCA based multi-view feature selection.

#17 Ensemble of machine learning algorithms for cognitive and physical speaker load detection [PDF] [Copy] [Kimi1]

Authors: How Jing ; Ting-Yao Hu ; Hung-Shin Lee ; Wei-Chen Chen ; Chi-Chun Lee ; Yu Tsao ; Hsin-Min Wang

We present our methods and results on participating in the Interspeech 2014 Computational Paralinguistics ChallengE (ComParE) of which the goal is to detect certain type of load of a speaker using acoustic features. There are in total seven classification models contributing to our final prediction, namely, neural network with rectified linear unit and dropout (ReLUNet), conditional restricted Boltzmann machine (CRBM), logistic regression (LR), support vector machine (SVM), Gaussian discriminant analysis (GDA), k-nearest neighbors (KNN), and random forest (RF). When linearly blending the predictions of these models, we are able to get significant improvements over the challenge baseline.

#18 Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks [PDF] [Copy] [Kimi1]

Authors: Gábor Gosztolya ; Tamás Grósz ; Róbert Busa-Fekete ; László Tóth

The Interspeech ComParE 2014 Challenge consists of two machine learning tasks, which have quite a small number of examples. Due to our good results in ComParE 2013, we considered AdaBoost a suitable machine learning meta-algorithm for these tasks, besides we also experimented with Deep Rectifier Neural Networks. These differ from traditional neural networks in that the former have several hidden layers, and use rectifier neurons as hidden units. With AdaBoost we achieved competitive results, whereas with the neural networks we were able to outperform baseline SVM scores in both Sub-Challenges.

#19 High-level speech event analysis for cognitive load classification [PDF] [Copy] [Kimi1]

Authors: Claude Montacié ; Marie-José Caraty

The Cognitive Load (CL) refers to the load imposed on an individual's cognitive system when performing a given task, and is usually associated with the limitations of the human working memory. Stress, fatigue, lower ability to make decisions and perceptual narrowing are induced by cognitive overload which occurs when too much information has to be processed. As many physiological measures and for a nonintrusive measurement, speech features have been investigated in order to find reliable indicators of CL levels. In this paper, we have investigated high-level speech events automatically detected using the CMU-Sphinx toolkit for speech recognition. Temporal events (speech onset latency, event starting time-codes, pause and phone segments) were extracted from the speech transcriptions (phoneme, word, silent pause, filled pause, breathing). Seven audio feature sets related to the speech events were designed and assessed. Three-class SVM classifiers (Low, Medium and High level) were developed and assessed on the CSLE (Cognitive-Load with Speech and EGG) databases provided for the Interspeech'2014 Cognitive Load Sub-Challenge. These experiments have shown an improvement of 1.5% on the Test set compared to the official baseline Unweighted Average Recall (UAR).

#20 On the use of Bhattacharyya based GMM distance and neural net features for identification of cognitive load levels [PDF] [Copy] [Kimi1]

Authors: Tin Lay Nwe ; Trung Hieu Nguyen ; Bin Ma

This paper presents a method for detecting cognitive load levels from speech. When speech is modulated by different levels of cognitive load, acoustic characteristics of speech change. In this paper, we measure acoustic distance of a stressed utterance from the baseline stress free speech using GMM-SVM kernel with Bhattacharyya based GMM distance. In addition, it is believed that airflow structure of speech production is nonlinear. This motivates us to investigate better techniques to capture nonlinear characteristic of stress information in acoustic features. Inspired by the recent success of neural networks for representation learning, we employ a single hidden layer feed forward network with non-linear activation to extract the feature vectors. Furthermore, people have different reactions to a particular task load. This inter-speaker difference in stress responses presents a major challenge for stress level detection. We use a bootstrapped training process to learn the stress response of a particular speaker. We perform experiments using data sets from Cognitive Load with Speech and EGG (CLSE) provided for the Cognitive Load Sub-Challenge of the INTERSPEECH 2014 Computational Paralinguistics Challenge. The results show that the system with our proposed strategies performs well on validation and test sets.

#21 Prediction of cognitive load from speech with the VOQAL voice quality toolbox for the interspeech 2014 computational paralinguistics challenge [PDF] [Copy] [Kimi1]

Author: Mark Huckvale

This paper describes the UCL system for the cognitive load task of the Interspeech 2014 Computational Paralinguistics Challenge. The UCL system evaluates whether additional voice features computed by the VOQAL voice analysis toolbox improves performance over the baseline feature set. 144 different system configurations are evaluated on the development test set, with some systems achieving 100% classification accuracy of cognitive load in the two Stroop subtasks. The difficulty of the reading span sub-task is shown to be caused in part by the duration of the audio material. Performance of the best systems on the test set confirm the importance of building speaker dependent systems. While the VOQAL augmented features gave the best performance on the development test set, no benefit was found for the test set.

#22 The UNSW submission to INTERSPEECH 2014 compare cognitive load challenge [PDF] [Copy] [Kimi1]

Authors: Jia Min Karen Kua ; Vidhyasaharan Sethu ; Phu Le ; Eliathamby Ambikairajah

Speech based cognitive load estimation is a new field of research. Due to this relative `lack of maturity', a single best approach to building cognitive load estimation systems has not been established yet. The primary aim of this submission is to report the performance of various basic utterance level classification frameworks developed using important elements of state-of-the-art speaker recognition systems. This may lead to a suitable basis for future cognitive load estimation systems. As a consequence of being a part of a challenge, it is expected that these frameworks will be compared to a much larger number of alternative approaches than what would otherwise be possible. In keeping with this focused aim, the GMM supervector approaches along with some variants are utilised. The systems outlined in this paper include a frame-level MFCC-GMM system along with utterance level GMM-supervector-SVM, GMM-ivector-SVM and GMM-JFA-SVM systems. The best combined system has an accuracy (UAR) of 66.6% as evaluated on the challenge development set and 63.7% as evaluated on the test set.

#23 Classification of cognitive load from speech using an i-vector framework [PDF] [Copy] [Kimi]

Authors: Maarten Van Segbroeck ; Ruchir Travadi ; Colin Vaz ; Jangwon Kim ; Matthew P. Black ; Alexandros Potamianos ; Shrikanth S. Narayanan

The goal in this work is to automatically classify speakers' level of cognitive load (low, medium, high) from a standard battery of reading tasks requiring varying levels of working memory. This is a challenging machine learning problem because of the inherent difficulty in defining/measuring cognitive load and due to intra-/inter-speaker differences in how their effects are manifested in behavioral cues. We experimented with a number of static and dynamic features extracted directly from the audio signal (prosodic, spectral, voice quality) and from automatic speech recognition hypotheses (lexical information, speaking rate). Our approach to classification addressed the wide variability and heterogeneity through speaker normalization and by adopting an i-vector framework that affords a systematic way to factorize the multiple sources of variability.

#24 Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch [PDF] [Copy] [Kimi1]

Authors: X. Chen ; Y. Wang ; X. Liu ; Mark J. F. Gales ; Philip C. Woodland

Recurrent neural network language models (RNNLMs) are becoming increasingly popular for a range of applications including speech recognition. However, an important issue that limits the quantity of data, and hence their possible application areas, is the computational cost in training. A standard approach to handle this problem is to use class-based outputs, allowing systems to be trained on CPUs. This paper describes an alternative approach that allows RNNLMs to be efficiently trained on GPUs. This enables larger quantities of data to be used, and networks with an unclustered, full output layer to be trained. To improve efficiency on GPUs, multiple sentences are “spliced” together for each mini-batch or “bunch” in training. On a large vocabulary conversational telephone speech recognition task, the training time was reduced by a factor of 27 over the standard CPU-based RNNLM toolkit. The use of an unclustered, full output layer also improves perplexity and recognition performance over class-based RNNLMs.

#25 Word pair approximation for more efficient decoding with high-order language models [PDF] [Copy] [Kimi1]

Authors: David Nolden ; Ralf Schlüter ; Hermann Ney

The search effort in LVCSR depends on the order of the language model ( LM); search hypotheses are only recombined once the LM allows for it. In this work we show how the LM dependence can be partially eliminated by exploiting the well-known word pair approximation. We enforce preemptive unigram- or bigram-like LM recombination at word boundaries. We capture the recombination in a lattice, and later expand the lattice using LM rescoring. LM rescoring unfolds the same search space which would have been encountered without the preemptive recombination, but the overall efficiency is improved, because the amount of redundant HMM expansion in different LM contexts is reduced. Additionally, we show how to expand the recombined hypotheses on-the-fly, omitting the intermediate lattice form. Our new approach allows using the full n-gram LM for decoding, but based on a compact unigram- or bigram search space. We show that our approach works better than common lattice rescoring pipelines, where a pruned lower-order LM is used to generate lattices; such pipelines suffer from the weak lower-order LM, which guides the pruning sub-optimally. Our new decoding approach improves the runtime efficiency by up to 40% at equal precision when using a large vocabulary and high-order LM.